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Abstract 

We propose a probabilistic model for the parallel execution of Las 
Vegas algorithms, i.e., randomized algorithms whose runtime might 
vary from one execution to another, even with the same input. This 
model aims at predicting the parallel performances {i.e., speedups) 
by analysis the runtime distribution of the sequential runs of the 
algorithm. Then, we study in practice the case of a particular Las 
Vegas algorithm for combinatorial optimization, on three classical 
problems, and compare with an actual parallel implementation up 
to 256 cores. We show that the prediction can be quite accurate, 
matching the actual speedups very well up to 100 parallel cores 
and then with a deviation of about 20% up to 256 cores. 

Categories and Subject Descriptors D [1.3]: Parallel program- 
ming; F [1.2]: Parallelism and concurrency; G [1.6]: Stochastic 
programming; G [3]: Probabilistic algorithms (including Monte 
Carlo); G [2.1]: Combinatorial algorithms 

General Terms Theory, Algorithms, Performance. 

Keywords Las Vegas algorithms, Prediction, Parallel Speed-ups, 
Local Search, Statistical Modeling, Runtime Distributions. 

1. Introduction 

We will consider in this paper Las Vegas algorithms, introduced 
a few decades ago by (5), i.e. randomized algorithms whose run- 
time might vary from one execution to another, even with the same 
input. An important class of Las Vegas algorithms is the family 
of Stochastic Local Search methods [29]. They have been used in 
Combinatorial Optimization for finding optimal or near-optimal so- 
lutions for several decades 1 1 1, stemming from the pioneering work 
of Lin on the Traveling Salesman Problem [31 j. Theses methods 
are now widely used in combinatorial optimization to solve real- 
life problems when the search space is too large to be explored by 
complete search algorithm, such as Mixed Integer Programming or 
Constraint Solving, c.f. |26|. 

In the last years, several proposal for implementing local search 
algorithms on parallel computer have been proposed, the most pop- 
ular being to run several competing instances of the algorithms 
on different cores, with different initial conditions or parameters, 
and let the fastest process win over others. We thus have an algo- 
rithm with the minimal execution time among the launched pro- 
cesses. This lead to so-called independent multi-walk algorithms 
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in the local search community 1391 and portfolio algorithms in the 
SAT community (satisfiability of Boolean formula) [25]. This par- 
allelization scheme can of course be generalized to any Las Vegas 
algorithm. 

The goal of this paper is to study the parallel performances of 
Las Vegas algorithms under this independent multi-walk scheme, 
and to predict the performances of the parallel execution from the 
runtime distribution of the sequential runs of a given algorithm. We 
will confront these predictions with actual speedups obtained for a 
parallel implementation of a local search algorithm and show that 
the prediction can be quite accurate, matching the actual speedup 
very well up to 100 parallel cores and then with a deviation limited 
to about 20% up to 256 cores. 

The paper is organized as follows. Section 2 is devoted to 
present the definition of Las Vegas algorithms, their parallel multi- 
walk execution scheme, and the main idea for predicting the par- 
allel speedups. Section[3]will detail the probabilistic model of Las 
Vegas algorithms and their parallel execution scheme. Section [4] 
will present the example of local search algorithms for combinato- 
rial optimization, while Section [5] will detail the benchmark prob- 
lems and the sequential performances. Then, Section [6] will apply 
the general probabilistic model to the benchmark results and thus 
predict their parallel speedup, which will be compared to actual 
speedups of a parallel implementation in Section [7] A short con- 
clusion and future work end will the paper. 

2. Preliminaries 

2.1 Las Vegas Algorithms 

We borrow the following definition from [29], Chapter 4. 

Definition 1 (Las Vegas Algorithm). An algorithm A for a problem 
class If is a (generalized) Las Vegas algorithm if and only if it has 
the following properties: 

1. If for a given problem instance ty £ II, algorithm A terminates 
returning a solution s, s is guaranteed to be a correct solution 
of TV. 

2. For any given instance 7r £ IT, the run-time of A applied to ty is 
a random variable. 

This is a slight generalization of the classical definition, as it 
includes algorithms which are not guaranteed to return a solution. 

A large class of Las Vegas algorithms is the so-called family of 
metaheuristics, such as Simulated Annealing, Genetic Algorithms, 
Tabu Search, Swarm Optimization, Ant-Colony optimization, etc, 
which have been applied to different sets of problems ranging from 
resource allocation, scheduling, packing, layout design, frequency 
allocation, etc. 

2.2 Multi-walk Parallel Extension 

Parallel implementation of local search metaheuristics [26 30) has 
been studied since the early 1990s, when parallel machines started 
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to become widely available 1341 1391 . With the increasing avail- 
ability of PC clusters in the early 2000s, this domain became ac- 
tive again 141 1131 . Apart from domain-decomposition methods and 
population-based method (such as genetic algorithms), |39| dis- 
tinguishes between single-walk and multi-walk methods for Local 
Search. Single-walk methods consist in using parallelism inside a 
single search process, e.g., for parallelizing the exploration of the 
neighborhood (see for instance 1381 for such a method making use 
of GPUs for the parallel phase). Multi-walk methods (parallel ex- 
ecution of multi-start methods) consist in developing concurrent 
explorations of the search space, either independently or cooper- 
atively with some communication between concurrent processes. 
Sophisticated cooperative strategies for multi-walk methods can be 
devised by using solution pools 1141 . but require shared-memory or 
emulation of central memory in distributed clusters, thus impacting 
on performance. A key point is that a multi-walk scheme is easier to 
implement on parallel computers without shared memory and can 
lead, in theory at least, to linear speedups 1 39 1 . However this is only 
true under certain assumptions and we will see that we need to de- 
velop a more realistic model in order to cope with the performance 
actually observed in parallel executions. 

Let us now formally define a parallel multi-walk Las Vegas 
algorithm. 

Definition 2 (Multi-walk Las Vegas Algorithm). An algorithm 
A' for a problem class II is a (parallel) multi-walk Las Vegas 
algorithm if and only if it lias the following properties: 

1. It consists of n instances of a sequential Las Vegas algorithm A 
/or II, say A\, A n . 

2. If for a given problem instance ty G II, there exists at least one 
i G [1, n] such that Ai terminates, then let A m , m G [1. n], be 
the instance of A terminating with the minimal runtime and let 
s be the solution returned by A m , Then algorithm A' terminates 
in the same time as A m and returns solution s. 

3. If for a given problem instance n 6 II, all Ai,i G [1, n], do 
not terminate then A' does not terminate. 

2.3 How to Estimate Parallel Speedup ? 

The multi-walk parallel scheme is rather simple, yet it provides 
an interesting test-case to study how Las Vegas algorithms can 
scale-up in parallel. Indeed runtime will vary among the processes 
launched in parallel and the overall runtime will be that of the 
instance with minimal execution time (i.e. "long" runs are killed 
by "shorter" ones). The question is thus to quantify the (relative) 
notion of short and long runs and their probability distribution. This 
might gives us a key to quantify the expected parallel speed-up. 
Obviously, this can be observed from the sequential behavior of 
the algorithm, and more precisely from the proportion of long and 
short runs in the sequential runtime distribution. 

In the following, we propose a probabilistic model to quantify 
the expected speed-up of multi-walk Las Vegas algorithms. This 
makes it possible to give a general formula for the speed-up, de- 
pending on the sequential behavior of the algorithm. Our model is 
related to order statistics, a rather new domain of statistics 1151 , 
which is the statistics of sorted random draws. Indeed, explicit for- 
mulas have been given for several well-known distributions. Rely- 
ing on an approximation of the sequential distribution, we com- 
pute the average speed-up for the multi-walk extension. Experi- 
ments show that the prediction is quite good and opens the way 
for defining more accurate models and apply them to larger classes 
of algorithms. 

Previous works 1 39 1 studied the case of a particular distribution 
for the sequential algorithm, the exponential distribution. This case 
is ideal and the best possible, as it yields a linear speed-up. Our 
model makes it possible to approximate Las Vegas algorithms by 



other types of distribution, such as a shifted exponential distribu- 
tion or a lognormal distribution. In the last two cases the speed-up 
is no longer linear, but admits a finite limit when the number of pro- 
cessors tends toward infinity. We will see that it fits experimental 
data for some problems. 

3. Probabilistic Model 

Local Search algorithms are stochastic processes. They include 
several random components: choice of an initial configuration, 
choice of a move among several candidates, plateau mechanism, 
random restart, etc. In the following, we will consider the compu- 
tation time of an algorithm (whatever it is) as a random variable, 
and use elements of probability theory to study its multi-walk par- 
allel version. Notice that the computation time is not necessarily the 
cpu-time; it can also be the number of iterations performed during 
the execution of the algorithm. 

3.1 Min Distribution 

Consider a given algorithm on a given problem of a given size, 
say, the MAGIC-SQUARE 10 x 10. Depending on the result of 
some random components inside the algorithm, it may find a so- 
lution after iterations, 10 iterations, or 10 6 iterations. The num- 
ber of iterations of the algorithm is thus a discrete random variable, 
let's call it Y, with values in N. Y can be studied through its cu- 
mulative distribution, which is by definition, the function Ty s.t. 
Ty{x) = P[Y < a;], or by its distribution, which is by definition 
the derivative of Ty '■ fy = T'y ■ 

It is often more convenient to consider distributions with values 
in R because it makes calculations easier. For the same reason, 
although fy is defined in N, we will use its natural extension to 
E. The expectation of the computation is then defined as E[Y] = 
J °°tfv(t)dt 

Assume that the base algorithm is concurrently run in parallel 
on n cores. In other words, over each core the running process is 
a fork of the algorithm. The first process which finds a solution 
then kills all others, and the algorithm terminates. The i-th process 
corresponds to a draw of a random variable Xi, following distri- 
bution fy. The variables Xi are thus independently and identically 
distributed (i.i.d.). The computation time of the whole parallel pro- 
cess is also a random variable, let Z^ n \ with a distribution f z i n ) 
that depends both on n and on fy. Since all the Xi are i.i.d., the 
cumulative distribution T z (n) can be computed as follows: 



V[Z [n) < x) 

P[3i G {l...n},Xi < x\ 

l-P[Vi S {l...n},Xi > x] 



= 1 
= 1 



llnxi 



> X 



(I-Ty(x)T 



which leads to: 



fzM = (1-(1-JV 
= nfy(l-Ty) 



Thus, knowing the distribution for the base algorithm Y, one can 
calculate the distribution for Z^ n '. In the general case, the formula 
shows that the parallel algorithm favors short runs, by killing the 
slower processes. Thus, we can expect that the distribution of Z^ n ' 
moves toward the origin, and is more peaked. As an example, 
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Figure 1. Distribution of Z^"\ in the case where Y admits a 
gaussian distribution (cut onl" and renormalized). The blue curve 
is Y. The distributions of are in pink for n = 10, in yellow 
for n = 100 and in green for n = 1000. 




o.oo: - 



1000 



Figure 2. For an exponential distribution, here in blue with xq = 
100 and A = 1/1000, simulations of the distribution of Z (n) for 
n — 2 (pink), n — 4 (yellow) and n — 8 (green). 



Figure[TJshows this phenomenon when the base algorithm admits a 
gaussian distribution. 

3.2 Expectation and Speed-up 

The model described above gives the probability distribution of a 
parallelized version of any random algorithm. We can now calcu- 
late the expectation for the parallel process with the following rela- 
tion: 



E[Z 



tf z(n) (t)dt 



= n tf Y (t)(l-F Y (t)) n - x dt 
Jo 

Unfortunately, this does not lead to a general formula for 
E[Z' n ']. In the following, we will study it for different specific 
distributions. 

To measure the gain obtained by parallelizing the algorithm on 
n core, we will study the speed-up Q n defined as: 

G„ = E[Y]/E[Z (n) ] 

Again, no general formula can be computed and the expression 
of the speed-up will depend on the distribution of Y. 

However, it is worth noting that our computation of the speed- 
up is related to order statistics, see [15] for a detailed presentation. 
For instance, the first order statistics of a distribution is its minimal 
value, and the k order statistic is its fc th -smallest value. For 
predicting the speedup, we are indeed interested in computing the 
expectation of the distribution of the minimum draw. As the above 
formula suggests, this may lead to heavy calculations, but recent 
studies such as 1331 give explicit formulas for this quantity for 
several classical probability distributions. 

3.3 Case of an Exponential Distribution 

Assume that Y has a shifted exponential distribution, as it has been 
suggested by (2) 121 ■ 



fr(t) = 

Tr{t) = 
K[Y] = 



if t < x 

Xe -X(t-x ) ift>Xo 

if t < x 

l_ e -A(t-s ) ift>X 

x + 1/A 



Then the above formula can be symbolically computed by hand: 



/*(»)(«) = 







if t < xo 



iXe _ nH t_ xo) jft> 



.Co 



^W(')- \ l _ e -n\{t-x ) lft 



> x 



The intuitive observation of section [3~T] is easily seen on the 
expression of the parallel distribution, which has an initial value 
multiplied by n but an exponential factor decreasing n-times faster, 
as shown on the curves of Figure|2] 

And in this case, one can symbolically compute both the expec- 
tation and speed-up for Z^ n ' : 



E\Z 



Wi 



= n\ I te 

o 

1 



-n\(t — xq ) 



<it 



= x + 



ii X 



^0 + X 

X + ZT 



Figure[3]shows the evolution of the speed-up when the number 
of cores increases. With such a rather simple formula for the speed- 
up, it is worth studying what happens when the number of cores 
n tends to infinity. Depending on the chosen algorithm, xo may 
be null or not. If xo — 0, then the expectation tends to and 
the speed-up is equal to n. This case has already been studied 
by 1391 . For xo > 0, the speed-up admits a finite limit which 

is x ° x =1-1 ^r. Yet, this limit may be reached slowly, but 

depends on the value of xq and A: obviously, the closest xo is 
to zero and the higher it will be. Another interesting value is the 
coefficient of the tangent at the origin, which approximates the 
speed-up for a small number of cores. In case of an exponential, 
it is (xo * A + 1). The higher xo and A, the bigger is the speed-up 
at the beginning. In the following, we will see that, depending on 
the combinations of Xq and A, different behaviors can be observed. 

3.4 Case of a Lognormal Distribution 

Other distributions can be considered, depending on the behavior 
of the base algorithm. We will study the case of a lognormal 
distribution, which is the log of a gaussian distribution, because 
it will be shown in Section [6^2| that it fits one experiment. It has 
two parameters, the mean /i and the standard deviation a. In the 
same way as the shifted exponential, we shift the distribution so that 
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Figure 3. Predicted speed-up in case of an exponential distribu- Figure 5. Predicted speed-up in case of a lognormal distribution, 
tion, with xo = 100 and A = 1/1000, w.r.t. the number of cores. with xq = 0, fj, = 5 and a = 1, depending on the number of cores 

(on the abscissa). 




Figure 4. For a lognormal distribution, here in blue with xo = 0, 
/i = 5 and a = 1, simulations of the distribution of Z' n 'for n = 2 
(pink), n — 4 (yellow) and n — 8 (green). 



it starts at a given parameter xo. Formally, a (shifted) lognormal 
distribution is defined as: 



M*) 



7V(i) = 



l er f c( M^Mt£0); 



if £ < xo 

if t > xo 

if i < xo 
if i > Xq 



where erfc is the complementary error function defined by 

Figure Ul depicts lognormal distributions of Z^ n \ for several n. 
The computations for the distribution of Z^ n \ its expectation and 
the theoretical speed-up are quite complicated formulas. But 1331 
gives an explicit formula for all the moments of lognormal order 
statistics with only a numerical integration step, from which we 
can derive a computation of the speed-up (since the expectation 
of Z' n 'is the first order moment for the first order statistics). This 
allows us to draw the general shape of the speed-up, an example 
being given on Figure [5] Due to the numerical integration step, 
which requires numerical values for the number of cores n, we 
restrict the computation to integer values of n. This is a reasonable 
limitation as the number of cores is indeed an integer. 

4. Application to Local Search 

Since about a decade, the interest for the family of Local Search 
methods and Metaheuri sties for solving large combinatorial prob- 
lems has been growing and has attracted much attention from both 



the Operations Research and the Artificial Intelligence communi- 
ties for solving real-life problems 1 26 28, 30]. 

4.1 Local Search for Constraint Solving 

Local Search starts from a random configuration and tries to im- 
prove this configuration, little by little, through small changes in 
the values of the problem variables. Hence the term "local search" 
as, at each time step, only new configurations that are "neighbors" 
of the current configuration are explored. The definition of what 
constitutes a neighborhood will of course be problem-dependent, 
but basically it consists in changing the value of a few variables 
only (usually one or two). The advantage of Local Search meth- 
ods is that they will usually quickly converge towards a solution (if 
the optimality criterion and the notion of neighborhood are defined 
correctly...) and not exhaustively explore the entire search space. 

Applying Local Search to Constraint Satisfaction Problems 
(CSP) has been attracting some interest since about a decade 
I10ll21ll28l . as it can tackle CSPs instances far beyond the reach of 
classical propagation-based constraint solvers f71 1271 . A generic, 
domain-independent constraint-based local search method, named 
Adaptive Search, has been proposed by 1101 1111 . This meta- 
heuristic takes advantage of the structure of the problem in terms 
of constraints and variables and can guide the search more pre- 
cisely than a single global cost function to optimize, such as for 
instance the number of violated constraints. The algorithm also 
uses a short-term adaptive memory in the spirit of Tabu Search in 
order to prevent stagnation in local minima and loops. 

4.2 Adaptive Search 

An implementation of Adaptive Search (AS) has been developed in 
C language as a framework library and is available as a freeware at 

the URL: 

http : // cri-dist .univ-parisl . f r/diaz/adaptive/ 

We used this reference implementation for our experiments. 
The Adaptive Search method can be applied to a large class of 
constraints (e.g. linear and non-linear arithmetic constraints, sym- 
bolic constraints, etc.) and naturally copes with over-constrained 
problems [36|. The input of the method is a Constraint Satisfac- 
tion Problem (CSP for short), which is defined as a triple (X;D;C), 
where X is a set of variables, D is a set of domains, i.e., finite sets 
of possible values (one domain for each variable), and C a set of 
constraints restricting the values that the variables can simultane- 
ously take. For each constraint, an error function needs to be de- 
fined; it gives, for each tuple of variable values, an indication of 
how much the constraint is violated. This idea has also been pro- 
posed independently by |21 1, where it is called "penalty functions", 
and then reused by the Comet system 1 281, where it is called "vi- 
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olations". For example, the error function associated with an arith- 
metic constraint \X — Y\ < c, for a given constant c > 0, can be 
max(0, \X - Y\ - c). 

Adaptive Search relies on iterative repair, based on variable and 
constraint error information, seeking to reduce the error on the 
worst variable so far. The basic idea is to compute the error function 
for each constraint, then combine for each variable the errors of all 
constraints in which it appears, thereby projecting constraint errors 
onto the relevant variables. This combination of errors is problem- 
dependent, see 1 10] for details and examples, but it is usually a 
simple sum or a sum of absolute values, although it might also be 
a weighted sum if constraints are given different priorities. Finally, 
the variable with the highest error is designated as the "culprit" 
and its value is modified. In this second step, the well known min- 
conflict heuristic |32] is used to select the value in the variable 
domain which is the most promising, that is, the value for which 
the total error in the next configuration is minimal. 

In order to prevent being trapped in local minima, the Adap- 
tive Search method also includes a short-term memory mechanism 
to store configurations to avoid (variables can be marked Tabu and 
"frozen" for a number of iterations). It also integrates reset tran- 
sitions to escape stagnation around local minima. A reset consists 
in assigning fresh random values to some variables (also randomly 
chosen). A reset is guided by the number of variables being marked 
Tabu. It is also possible to restart from scratch when the number of 
iterations becomes too large (this can be viewed as a reset of all 
variables but it is guided by the number of iterations). The core 
ideas of adaptive search can be summarized as follow: 

• to consider for each constraint a heuristic function that is able 
to compute an approximated degree of satisfaction of the goals 
(the current error on the constraint); 

• to aggregate constraints on each variable and project the error 
on variables thus trying to repair the worst variable with the 
most promising value; 

• to keep a short-term memory of bad configurations to avoid 
looping {i.e. some sort of tabu list) together with a reset mech- 
anism. 

5. Benchmark Problems and Experimental 
Results 

We have chosen to test this method on two problems from the 
CSPLib benchmark library 1 22 1, and on a hard combinatorial prob- 
lem abstracted from radar and sonar applications. After briefly in- 
troducing the classical benchmarks, we detail the latter problem, 
called COSTAS ARRAY. Then we show the performance and the 
speed-ups obtained with both sequential and a multi-walk Adap- 
tive Search algorithm on these problems. 

We use two classical benchmarks from CSPLib consisting of: 

• The ALL-INTERVAL Series problem (prob007 in CSPLib), 

• The MAGIC-SQUARE problem (prob019 in CSPLib). 

Although these benchmarks are academic, they are abstractions 
of real-world problems and could involve very large combinato- 
rial search spaces, e.g., the 200x200 MAGIC-SQUARE problem 
requires 40,000 variables whose domains range over 40,000 val- 
ues. Indeed the search space in the Adaptive Search model (us- 
ing permutations) is 40, 000!, i.e., more than 10 166713 configura- 
tions. Classical propagation-based constraint solvers cannot solve 
this problem for instances higher than 20x20. Also note that we 
are tackling constraint satisfaction problems as optimization prob- 
lems, that is, we want to minimize the global error (representing 
the violation of constraints) to value zero, therefore finding a solu- 



tion means that we actually reach the bound (zero) of the objective 
function to minimize. 

5.1 The All-Interval Series Problem 



This problem is described as prob007 in the CSPLib. This bench- 
mark is in fact a well-known exercise in music |37] where the 
goal is to compose a sequence of N notes such that all are dif- 
ferent and tonal intervals between consecutive notes are also dis- 
tinct. This problem is equivalent to finding a permutation of the 
TV first integers such that the absolute difference between two con- 
secutive pairs of numbers are all different. This amounts to find- 
ing a permutation (AT, . . . Xn) of {0, . . . N — 1} such that the 
list (abs(Xi — X2), abs(X'2 — A3) . . . abs(Ajv_i — xn)) is a 
permutation of 1, . . . , N — 1. A possible solution for N = 8 is 
(3, 6, 0, 7, 2, 4, 5, 1) since all consecutive distances are different: 







5.2 The Magic-Square Problem 
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The Magic-Square problem 
is catalogued as prob019 in 
CSPLib and consists in plac- 
ing the numbers {1, 2 • • • N 2 } 
on an A x A square such that 
34 34 34 34 34 34 each row, column and main di- 
agonal equal the same sum (the 

constant N(N 2 + l)/2). 

For instance, this figure shows a well-known solution for N = 4 
(depicted by Albrecht Diirer in his engraving Melancholia /, 1514). 

5.3 The Costas Array Problem 



A Costas array is an N x N grid con- 
taining N marks such that there is ex- 
actly one mark per row and per col- 
umn and the N(N — l)/2 vectors join- 
ing the marks are all different. We give 
here an example of Costas array of size 
5. It is convenient to see the COSTAS 
ARRAY Problem (CAP) as a permu- 
tation problem by considering an ar- 
Vn) which forms a permutation of 



ray of N variables (Vi, 

{1,2,..., N}. The above Costas array can thus be represented by 
the array [3,4,2,1,5]. 

Historically these arrays have been developed in the 1960's to 
compute a set of sonar and radar frequencies avoiding noise 1121 . 
A very complete survey on Costas arrays can be found in 1171 . 
The problem of finding a Costas array of size N is very complex 
since the required time grows exponentially with N. In the 1980's, 
several algorithms have been proposed to build a Costas array given 
N (methods to produce Costas arrays of order 24 to 29 can be found 
in (6] [TU [TH [35)), such as the Welch construction (23) and the 
Golomb construction [24], but these methods cannot built Costas 
arrays of size 32 and some higher non-prime sizes. Nowadays, 
after many decades of research, it remains unknown if there exist 
any Costas arrays of size 32 or 33. Another difficult problem is 
to enumerate all Costas arrays for a given size. Using the Golomb 
and Welch constructions, Drakakis et. al present in 1191 all Costas 
arrays for N — 29. They show that among the 29! permutations, 
there are only 164 Costas arrays, and 23 unique Costas arrays up to 
rotation and reflection. 
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5.4 Sequential Results 

We run our benchmarks in a sequential manner in order to have 
about 650 runtimes for each. Sequential experiments, as well as 
parallel experiments, have been done on the Griffon cluster of 
the Grid' 5000 platform. The following Tables [T] and [2] shows the 
minimum, mean, median and maximum respectively among the 
runtimes and the number of iterations of our benchmarks. 



Problem 


Min 


Mean 


Median 


Max 


MS 200 
AI 700 
Costas 21 


5.51 
23.25 
6.55 


382.0 
1,354.0 
3,744.4 


126.3 
945.4 
2,457.4 


7,441.6 
10,243.4 
19,972.0 





Table 1. Sequential execution times (in seconds) 



Problem 


Min 


Mean 


Median 


Max 


MS 200 
AI 700 
Costas 21 


6,210 
1,217 
321,361 


443,969 
110,393 
183,428,617 


164,042 
76,242 
119,667,588 


7,895,872 
826,871 
977,709,115 



Table 2. Sequential number of iterations 



One can see that runtimes and the number of iterations, respec- 
tively from Tables [T] and [2] are spread over a large interval, illus- 
trating the stochasticity of the algorithm. Depending on the bench- 
mark, there is a ratio of a few thousands between the minimum and 
the maximum runtimes. 



not increase as fast as the number of cores, i.e., are getting further 
away from linear speed-up. This is visually depicted on Figure [6] 
For the COSTAS ARRAY Problem, the AS algorithm reaches linear 
or even supra-linear speed-ups up to 256 cores, as depicted in Fig- 
ure|7] Actually, it scales linearly far beyond this point, i.e., at least 
up to 8,192 cores, as reported in 1161 . 











Ideal 






MS 200 






AI700 





16 32 64 128 256 

number of cores 



Figure 6. Speed-ups for CSPLib benchmarks 



5.5 Parallel Results 

We have conduct parallel experiments on the Grid5000 plat- 
form 1 8 1, the French national grid for research, which contains 
8,596 cores deployed on 1 1 sites distributed in France. For our ex- 
periments, we used the Griffon cluster at Nancy, composed of 1 84 
Intel Xeon L5420 (Quad-core, 2.5GHz, 12MB of L2-cache, bus 
frequency at 1333MHz), thus with a total of 736 cores available 
giving a peak performances of 7.36TFlops. 

Tables [3] and [4] present the execution times and the number 
of iterations, respectively, with speed-ups for executions of large 
benchmarks MAGIC-SQUARE 200x200, ALL-INTERVAL with n 
= 700 and COSTAS ARRAY with n = 21, up to 256 cores. The 
same code has been ported and executed, timings are given in 
seconds and are the average of 50 runs. One can notice there are 
no significant differences between speed-ups of these two tables, 
therefore we will prefer as a time measurement the number of 
iterations, which has the good property of not being machine- 
dependent. Anyway, similar speed-ups have been achieved on other 
parallel machines (9). 



Problem 


time on 1 core 
(seconds) 


speed-up on k cores 


16 


32 


64 


128 


256 


MS 200 
AI 700 
Costas 21 


382.0 
1,354.0 
3,744.4 


18.3 
12.9 
15.7 


24.5 
19.3 
26.4 


32.3 
30.6 
59.8 


37.0 
39.2 
154.5 


47.8 
45.5 
274.8 


Table 3. Speed-ups with respect to sequential time 


Problem 


# of iterations 
on 1 core 


speed-up on k cores 


16 


32 


64 


128 


256 


MS 200 
AI 700 

Costas 21 


443,969 
110,393 
183,428,617 


16.6 
12.8 
15.8 


22.2 
20.2 
26.4 


29.9 
29.3 
60.0 


34.3 
37.3 
159.2 


45.0 
48.0 
290.5 



Table 4. Speed-ups with respect to sequential number of iterations 



For the two CSPLib benchmarks, one can observe the stabiliza- 
tion point is not yet obtained for 256 cores, even if speed-ups do 




number of cores 



Figure 7. Speed-ups for the COSTAS ARRAY Problem 



6. Prediction of Parallel Speed-ups 

On each problem, the sequential benchmark gives observations of 
the distribution of the algorithm runtime fy. Yet, the exact distri- 
bution is still unknown. It can be any real distribution, not even a 
classical one. In the following, we will rely on the assumption that 
Y is distributed with a known parametric distribution. We perform a 
statistical test, called Kolmogorov-Smirnov test, on the hypothesis 
Ho that the collected observations correspond to a theoretical dis- 
tribution. Assuming Ho, the test first computes the probability that 
the distance between the collected data and the theoretical distri- 
bution, does not significantly differ from its theoretical value. This 
probability is called the p- value. Then, the p- value is compared to 
a fixed threshhold (usually 0.05). If it is smaller, one rejects Ho- 
For us, it means that the observations do not correspond to the the- 
oretical distribution. If the p- value is high, we will consider that the 
distribution of Y is approximated by the theoretical one. Note that 
the Kolmogorov-Smirnov test is a statistical test, which in no way 
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Figure 8. Histogram of the observed number of iterations for 
720 runs on the ALL-INTERVAL series problem with TV = 700, 
in blue. In red, the corresponding shifted exponential distribution, 
statistically estimated. 




speed - up 



Figure 9. Predicted speed-up for AI 700 as a function of the 
number of cores (plain blue), with its limit (dashed yellow) and 
the ideal linear speed-up (dashed pink). 



probability 



proves that Y follows the distribution. However, it measures how 
well the observations fit a theoretical curve and, as it will be seen 
in the following, it is accurate enough for our purpose. 

The distributions tested for local search algorithms are the expo- 
nential distribution, as suggested by 1201 , and the lognormal distri- 
bution, because it appears to fit the MAGIC-SQUARE problem. We 
have also performed the Kolmogorov-Smirnov test on other distri- 
butions (e.g. gaussian and Levy), but obtained negative results w.r.t. 
the experimental benchmarks, thus we do not include them in the 
sequel. For each problem, we will need to estimate the value of 
the parameters of the distribution, which is done on a case by case 
basis. 

Once we have an estimated distribution for the runtimes of Y, 
it becomes possible to compute the expectation of the parallel run- 
times and the speed-up thanks to formulas of Section [3~2| All the 
mathematical calculations are done with Mathematica, a commer- 
cial software for symbolic computation |40|. 

In the following, all the analyses are done on the number of 
iterations, because they are more likely to be unbiased. 

6.1 The All-Interval Series Problem 

The analysis is done on 720 runs of the Adaptive Search algorithm 
on the instance of ALL-INTERVAL series for 700 notes. The se- 
quence of observations is written AI 700 in the following. 

We test the hypothesis that the observations admit a shifted 
exponential distribution as introduced in Section [33] The first step 
consists in estimating the parameter of the distribution^which for 
a shifted exponential are the value of the shift xo and )r \ We take 
for xo the minimum observed value, x — 1217. The exponential 
parameter is estimated thanks to the following relation: for a non- 
shifted exponential distribution, the expectation is 1/A. Thus we 
take A = l/(mean(AI 700) - x ), which gives A = 9.15956 * 
10" 6 . 

We then run the Kolmogorov-Smirnov test on the shifted expo- 
nential distribution with these values of xo and A, which answers 
positively (computed p-value: 0.77435). We thus admit the hypoth- 
esis that AI 700 fits this shifted exponential distribution. As an illus- 
tration, Figure [8] shows the normalized histogram of the observed 
runtimes and the theoretical distribution. 

It is then possible to symbolically compute the speed-up that 
can be expected with the parallel scheme described in Section [2~2| 
We use the formulas of Section [33] with the estimated parameters 



4. * 10 



2. * 10" 6 




4 * 10' 



6 * 10" 



8 * 10 6 



All the notations are the same as in section 







Figure 10. Histogram of the observed number of iterations for 662 
runs on the MAGIC-SQUARE problem with TV = 200, in blue. In 
red, the corresponding shifted lognormal distribution, statistically 
estimated. 



and obtain a theoretical expression for the speed-up. This allows us 
to calculate its value for different number of cores. 

The results are given on Figure [9] With this approximated dis- 
tribution, the limit of the speed-up when the number of cores tends 
to infinity is 90.7087. One can see that, with a 256 cores, the curve 
has not reached its limit, but comes close. Thus, the speed-up for 
this instance of ALL-INTERVAL appears significantly less than lin- 
ear (linear meaning: equal to the number of cores). 

6.2 The Magic-Square Series Problem 

For the MAGIC-SQUARE problem with N = 200, the observations 
are the number of iterations on 662 runs, with a minimum of 
Xo — 6210. The Kolmogorov-Smirnov test on a shifted exponential 
distribution fails, but we obtain a positive result with a lognormal 
distribution, with fx = 12.0275 and a = 1.3398, shifted to 
xo- These parameters have been estimated with the use of the 
Mathematica software. As an illustration, Figure [10] shows the 
observations and the theoretical estimated distribution. 

The speed-up can be computed by integrating the minimum 
distribution with numerical integration techniques. The results are 
presented on Figure [TT| We can observe that the speed-up grows 
very fast at the origin, which can be explained by the high peak of 
the lognormal distribution with these parameters. Again, the speed- 
up is computed with a numerical integration step, and we only draw 
the curve for integer values of n. In this case again, the speed-up is 
significantly less than linear from 50 cores onwards, and the limit 
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Figure 11. Predicted speed-up for MS 200 as a function of the 
numbers of cores (blue dots), with the ideal linear speed-up (dashed 
pink). 
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Figure 12. Histogram of the observed number of iterations for 
638 runs on the COSTAS ARRAY problem with TV = 21, in blue. In 
red, the corresponding shifted exponential distribution, statistically 
estimated. 



of the speed-up when the number of cores tends to infinity is about 
71.5. 



6.3 The Costas Array Problem 

The same analysis is done for the runs of the AS algorithm on 
the COSTAS Array problem with N = 21. The observations 
are taken from the benchmark with 638 runs. The sequence of 
observations is written Costas 21. 

This benchmark has an interesting property: the observed 
minimum, 3.2 * 10 5 is neglictible compared to its mean (1.8 * 
10 8 ). Thus, we estimate xo — and perform a Kolmogorov- 
Smirnov test for a (non-shifted) exponential distribution, with 
A = l/mean(Costas 21) = 5.4 * 10 -9 . The test is positive for 
this exponential distribution, with a p- value of 0.751915. Figure[T2] 
shows the estimated distribution compared to the observations. 

The computation of the theoretical speed-up is also done in the 
same way as for AI 700. Yet, in this case, the observed minimum for 
xo is so small that we can approximate the observations with a non- 
shifted distribution, thus the predicted speed-up is strictly linear, 
as shown in Section [373] The results are given on Figure [T3] This 
explains that one may observe linear speed-up when parallelizing 
Costas Array. 




Figure 13. Predicted speed-up for Costas 21 as a function of the 
number of cores. 



7. Analysis 

Table|5]presents the comparison between the predicted and the ex- 
perimental speedups. We can see that the accuracy of the prediction 
is very good up to 64 parallel cores and then the divergence is lim- 
ited even for 256 parallel cores. 

For the MS 200 problem, the experimental speed-up and the 
predicted one are almost identical up to 128 cores and diverging 
by 10% for 256 cores. For the AI 700 problem, the experimental 
speed-up is less good than the predicted one by a maximum of 30% 
for 128 and 256 cores. For the Costas 21 problem, the experimental 
speed-up is better than the predicted one by 15% for 128 and 256 
cores. 



Problem 






speed-up on k 


cores 




16 


32 


64 


128 


256 


MS 200 


experimental 


16.6 


22.2 


29.9 


34.3 


45.0 




predicted 


15.94 


22.04 


28.28 


34.26 


39.7 


AI 700 


experimental 


12.8 


20.2 


29.3 


37.3 


48.0 




predicted 


13.7 


23.8 


37.8 


53.3 


67.2 


Costas 21 


experimental 


15.8 


26.4 


60.0 


159.2 


290.5 




predicted 


16.0 


32.0 


64.0 


128.0 


256.0 



Table 5. Comparison: experimental and predicted speedups 

It is worth noticing that our model approximates the behav- 
iors of experimental results very closely, as shown by the predicted 
speed-ups matching closely the real ones. Moreover we can see that 
on the three benchmark programs, we needed to use three different 
types of distribution (exponential, shifted exponential and lognor- 
mal), in order to approximate the experimental data most closely. 
This shows that our model is quite general and can accommodate 
different types of parallel behaviors. 

A quite interesting behavior is exhibited by the Costas 21 prob- 
lem. Our model predicts a linear speedup, up to 10,000 cores and 
beyond, and the experimental data gathered for this paper confirms 
this linear speed-up up to 256 cores. Would it scale up with a larger 
number of cores? Indeed such an experiment has been done up to 
8,192 cores on the JUGENE IBM Bluegene/P at the Jiilich Super- 
computing Center in Germany (with a total 294,912 cores), and 
reported in 1161 , of which Figure|7]is adapted. We can see that the 
speed-up is indeed linear up to 8,192 cores, thus showing the ade- 
quation of the prediction model with the real data. 

Finally, let us note that our method exhibits an interesting phe- 
nomenon. For the three problems considered, the probability of re- 
turning a solution in no iterations is non-null: since they start by a 
uniform random draw on the search space, there is a very small, but 
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number of cores 

Figure 14. Speed-ups for Costas 21 up to 8,192 cores 



not null, probability that this random initialization directly returns 
the solution. Hence, in theory, xo = and the speed-up should be 
linear, with an infinite limit when the number of cores tends to in- 
finity. Intuitively, if the number of cores tends to infinity, at some 
point it will be large compared to the size of the search space and 
one of the cores is likely to immediately find the solution. 

Yet, in practice, observations shows that the experimental 
curves may be better approximated by a shifted exponential with 
xo > 0, as it is the case for AI 700. With an exponential distribu- 
tion, this leads to non-linear speed-up with a finite limit. Indeed, 
the experimental speed-up for AI 700 is far from linear. On the con- 
trary, Costas 21 has a linear speed-up due to its xo << 1/A, which 
makes the statistical test succeed for xo — 0. Firstly, this suggests 
that the comparison between xo and 1/A on a number of observa- 
tions is a key element for the parallel behavior. It also means that 
the number of observations needed to properly approximate the 
sequential distribution probably depends on the problem. 

8. Conclusion 

We have proposed a theoretical model for predicting and analyz- 
ing the speed-ups of Las Vegas algorithms. It is worth noticing that 
our model mimics the behaviors of the experimental results very 
closely, as shown by the predicted speedups matching closely the 
real ones. Our practical experiments consisted in testing the accu- 
racy of the model with respect to three instances of a local search al- 
gorithm for combinatorial optimization problems. We showed that 
the parallel speed-ups predicted by our statistical model are accu- 
rate, matching the actual speed-ups very well up to 64 parallel cores 
and then with a deviation of about 10%, 15% or 30% (depending 
on the benchmark problem) up to 256 cores. 

However, one limitation of our approach is that, in practice, we 
need to be able to compute the expectation of the minimum dis- 
tribution. Nevertheless, apart from the exponential distribution for 
which this computation is easy, recent results in the field of or- 
der statistics gives explicit formulas for a number of useful distri- 
butions: gaussian, lognormal, gamma, beta. This provides a wide 
range of tools to analyze different behaviors. In this paper we val- 
idated our approach on classical combinatorial optimization and 
CSP benchmarks, but further research will consider a larger class 
of problems and algorithms, such as SAT solvers and other random- 
ized algorithms (e.g. quick sort). 

Another interesting extension of this work would be to devise 
a method for predicting the speed-up from scratch, that is, without 
any knowledge on the algorithm distribution. Preliminary observa- 
tion suggests that, given a problem and an algorithm, the general 
shape of the distribution is the same when the size of the instances 



varies. For example, the different instances of ALL-INTERVAL that 
we tested all admit a shifted exponential distribution. If this prop- 
erty is valid on a wide range of problems/algorithms, then we can 
develop a method for predicting the speed-up for large instances by 
learning the distribution shape on small instances (which are eas- 
ier to solve), and then estimating the parallel speed-up for larger 
instances with our model. 
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